Description

A movie chain in the southwest region, MovieMagic is considering ways in which it can increase spending on concessions. It has collected information of 2000 of its customers, some of whom are part of their loyalty program and some who are not. They have information on the following 8 variables, which they plan to use as predictors. They plan to use amount_spent (i.e. the amount spent on concessions) as the outcome variable since they have learnt from observations that much of the profit is derived from concession sales.

Predictors

  • age = age of the customer
  • job = type of job e.g. management, technician, entrepreneur,
  • marital = married, single, divorced
  • education = primary, secondary, tertiary
  • seen_alone = whether the movie was seen alone or with some others (yes/no)
  • discount = whether they were sent a discount coupon (yes/no)
  • days_member = days since member of MovieMagic
  • movies_seen = number of movies seen last time period

Outcome

amount_spent = amount spent on concessions

Their objective is to find out what factors can increase concession sales and how they can improve their prediction of the outcome variable so that they can plan better.

Along with amount_spent, MovieMagic was also able to collect information form about 150 of its existing customers in the form of reviews. They feel that this text data can provide a different insight into what customers think and feel about MovieMagic.

They realize that their objective has two components: interpretation and prediction. Hence, they decide to run 3 different types of analysis. - 1. Linear regression - 2. Support Vector Regression (SVR) - 3. Text analysis

When the project, henceforth, mentions 3 analysis, the above-mentioned would be the 3 analysis.

Consider that you have been asked to run the analysis and answer the questions MovieMagic wants answered.


Analysis

Libraries + Data

# EDA + general
library(skimr)
library(tidyverse)
library(psych)
library(scales)
library(knitr)
library(kableExtra) # fancy tables
library(rminer)
library(effects)
library(car)


# model training
library(caret)
library(elasticnet)
library(glmnet)
library(ROCR)
library(e1071)#to run svm

# nlp
library(quanteda)
library(seededlda)
library(topicmodels)
library(RTextTools)
library(wordcloud)
library(tm)




# Read in the data
purchases <- read.csv("http://data.mishra.us/files/project_data.csv")
reviews <- read.csv("http://data.mishra.us/files/project_reviews.csv")

Descriptive Analytics

EDA: Purchasing dataset

This dataset has five categorical variables and 4 numeric veriables, with no missing data in any observation. The majority of movie viewers:

  • are married
  • have a secondary education
  • are not seen along
  • are using a discount

Ages range from 21 to 61, with a slight left skewing. On average, a person is seeing 1.9 movies and median spend rate is $216.

At first glance of a pairs panel, it appears ‘experienced’ movie goers, such as those who’ve seen a lot of movies at the theatre or who’ve been members for awhile, are spending less on concessions. ‘Expereinced’ movie-goers may be developing traditions or habits such as going to dinner beforehand or ‘tricks’ like bringing in their own food. When looking at just movies seen ~ spending, we find the outliers are mostly coming from those who are seeing 3 or fewer movies. There is a slight bump in

Those who were seen alone appear to spend less, though this may be misleading as we don’t know the size of their group. I can only imagine those who are spending over 10k on consessions are hosting parties or large groups at the theatre. Might be worth looking at removing if they’re outliers as this group would require a whole different marketing strategy catering towards coordinating groups.

skim(purchases)
Data summary
Name purchases
Number of rows 2000
Number of columns 9
_______________________
Column type frequency:
factor 5
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
job 0 1 FALSE 12 blu: 675, tec: 286, man: 278, adm: 252
marital 0 1 FALSE 3 mar: 1208, sin: 520, div: 272
education 0 1 FALSE 4 sec: 1101, pri: 394, ter: 377, unk: 128
seen_alone 0 1 FALSE 2 no: 1955, yes: 45
discount 0 1 FALSE 2 yes: 1786, no: 214

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.54 9.12 21 33 39 47 61 ▂▇▆▅▃
days_member 0 1 262.40 249.35 0 124 197 310 2462 ▇▁▁▁▁
movies_seen 0 1 1.88 1.16 1 1 2 2 9 ▇▂▁▁▁
amount_spent 0 1 517.18 980.58 0 66 216 510 10350 ▇▁▁▁▁
pairs.panels(purchases) 

hist(log(purchases$amount_spent))

ggplot(purchases, aes(job, amount_spent)) + 
  geom_boxplot() + 
  scale_y_continuous(labels=dollar)

ggplot(purchases, aes(education, amount_spent)) + 
  geom_boxplot() + 
  scale_y_continuous(labels=dollar)

ggplot(purchases, aes(factor(movies_seen), amount_spent)) + 
  geom_boxplot() + 
  scale_y_continuous(labels=dollar)

Linear Regression

Linear regression model has an R-squared value of 0.02532, which seems very low for basing judegements on whether variables are correlated or not with the amount spent. There’s two variables which are significant.

set.seed(123)
model.lm.fullset <- lm(amount_spent ~., purchases)
summary(model.lm.fullset) 
## 
## Call:
## lm(formula = amount_spent ~ ., data = purchases)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1131.3  -410.4  -269.8    22.3  9575.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         243.68847  180.06384   1.353  0.17610   
## age                   4.78582    2.75768   1.735  0.08282 . 
## jobblue-collar      -19.03636   76.06472  -0.250  0.80241   
## jobentrepreneur     135.54936  147.53145   0.919  0.35832   
## jobmanagement       105.30974  101.24757   1.040  0.29841   
## jobretired         -130.82587  144.88928  -0.903  0.36667   
## jobself-employed    -19.70345  157.38334  -0.125  0.90038   
## jobservices           0.06220   87.98871   0.001  0.99944   
## jobstudent          -92.11807  200.99827  -0.458  0.64679   
## jobtechnician       -14.76209   85.46043  -0.173  0.86288   
## jobunemployed       246.57788  154.07094   1.600  0.10967   
## jobunknown           33.40031  408.14602   0.082  0.93479   
## jobwait-staff       590.69694  223.53039   2.643  0.00829 **
## maritalmarried      -43.62729   66.70185  -0.654  0.51315   
## maritalsingle       -61.88122   78.09725  -0.792  0.42825   
## educationsecondary   54.10943   62.70206   0.863  0.38826   
## educationtertiary   264.07703   92.07081   2.868  0.00417 **
## educationunknown     63.18623  103.52489   0.610  0.54170   
## seen_aloneyes      -249.44346  147.38601  -1.692  0.09072 . 
## discountyes          10.38529   71.48502   0.145  0.88451   
## days_member           0.11692    0.08768   1.334  0.18251   
## movies_seen          -6.59583   18.92702  -0.348  0.72751   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 973.2 on 1978 degrees of freedom
## Multiple R-squared:  0.02532,    Adjusted R-squared:  0.01497 
## F-statistic: 2.447 on 21 and 1978 DF,  p-value: 0.0002671
# normally distributed residuals?
mean(model.lm.fullset$residuals)
## [1] 3.273445e-14
# homoscedasticity?
plot(model.lm.fullset)

# multicollinearity?
car::vif(model.lm.fullset)
##                 GVIF Df GVIF^(1/(2*Df))
## age         1.335362  1        1.155578
## job         2.659907 11        1.045471
## marital     1.289658  2        1.065660
## education   2.323358  3        1.150852
## seen_alone  1.008850  1        1.004415
## discount    1.031051  1        1.015407
## days_member 1.008786  1        1.004384
## movies_seen 1.011173  1        1.005571
# autocorrelation
lmtest::dwtest(model.lm.fullset)
## 
##  Durbin-Watson test
## 
## data:  model.lm.fullset
## DW = 1.9615, p-value = 0.1933
## alternative hypothesis: true autocorrelation is greater than 0

Modifications to LM

I have a hypothesis that single individuals are hosting groups and spending a lot larger amount of money than normal. If true, I think these two groups would require different marketing strategies. For this, I predict large spending over $1,000 separately from those which are under $100, something a family 5 could spend on dinner and drinks at a theatre. Separating out these two groups results in two models both with higher R^2 values, niether of which would be high enough to feel comfortable basing heavy decisions on.

I also look at a log function of amount_spent since it’s highly skewed and zero-bounded.

purchases_eng <- purchases %>% 
  mutate(
    isMember = factor(ifelse(days_member ==0, 0,1)),
    movies_seen = factor(movies_seen)
  )

purchases_high <- purchases %>% 
  filter(amount_spent>1000)

purchases_low <- purchases %>% 
  filter(amount_spent<100)

## I was going to look into predicting if someone spent at all, but will reserve for another analysis. 
# purchases_zero <- purchases_eng %>% 
#   mutate(
#     didPurchase = factor(ifelse(amount_spent==0, 0,1)),
#   ) %>% 
#   select(-amount_spent)


model_low <- train(amount_spent ~., purchases_low, method='lm')
summary(model_low) 
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.716 -27.485  -8.225  26.906  77.980 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         13.527269  10.075597   1.343 0.179894    
## age                  0.507322   0.155954   3.253 0.001203 ** 
## `jobblue-collar`    -3.935443   4.452023  -0.884 0.377052    
## jobentrepreneur      3.201837   9.476769   0.338 0.735580    
## jobmanagement       -9.953818   6.412432  -1.552 0.121102    
## jobretired         -10.657489   8.151050  -1.307 0.191522    
## `jobself-employed` -13.511950   8.981577  -1.504 0.132979    
## jobservices         -0.074223   5.129879  -0.014 0.988461    
## jobstudent           5.303376  11.038702   0.480 0.631086    
## jobtechnician       -7.513190   4.982852  -1.508 0.132106    
## jobunemployed      -17.157006  10.054110  -1.706 0.088416 .  
## jobunknown          -8.064862  16.893099  -0.477 0.633239    
## `jobwait-staff`    -15.811973  14.914403  -1.060 0.289470    
## maritalmarried      -1.856065   3.861842  -0.481 0.630957    
## maritalsingle        2.497457   4.528032   0.552 0.581450    
## educationsecondary  -6.935446   3.536126  -1.961 0.050284 .  
## educationtertiary   -5.795294   5.578084  -1.039 0.299233    
## educationunknown    -8.871741   5.612771  -1.581 0.114465    
## seen_aloneyes       -3.921053   6.342004  -0.618 0.536623    
## discountyes         11.865644   3.548561   3.344 0.000876 ***
## days_member         -0.004277   0.005832  -0.733 0.463652    
## movies_seen          1.815173   1.048710   1.731 0.083967 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31.94 on 628 degrees of freedom
## Multiple R-squared:  0.06488,    Adjusted R-squared:  0.03361 
## F-statistic: 2.075 on 21 and 628 DF,  p-value: 0.003374
model_high <- train(amount_spent ~., purchases_high, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
summary(model_high) 
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2801.8  -935.7  -462.6   354.4  7122.0 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4449.8649   876.7227   5.076 7.94e-07 ***
## age                 -20.5677    13.0165  -1.580   0.1154    
## `jobblue-collar`   -138.1964   424.6701  -0.325   0.7452    
## jobentrepreneur     536.7240   709.3356   0.757   0.4500    
## jobmanagement        17.9080   476.0889   0.038   0.9700    
## jobretired         1789.1434  1077.9855   1.660   0.0983 .  
## `jobself-employed`  -37.8031   870.1437  -0.043   0.9654    
## jobservices         -42.7880   467.9268  -0.091   0.9272    
## jobstudent          -27.6073  1778.1471  -0.016   0.9876    
## jobtechnician      -174.7023   447.6008  -0.390   0.6967    
## jobunemployed       753.3475   757.3892   0.995   0.3209    
## jobunknown          111.4047  1805.5523   0.062   0.9509    
## `jobwait-staff`    1907.5553   949.1847   2.010   0.0456 *  
## maritalmarried     -758.8439   347.2423  -2.185   0.0299 *  
## maritalsingle      -815.0512   408.8288  -1.994   0.0474 *  
## educationsecondary -224.4673   367.7496  -0.610   0.5422    
## educationtertiary   230.9600   445.8317   0.518   0.6049    
## educationunknown   -256.8584   509.7955  -0.504   0.6148    
## seen_aloneyes       252.2936  1298.9299   0.194   0.8462    
## discountyes        -584.2062   373.6642  -1.563   0.1193    
## days_member           0.2716     0.4145   0.655   0.5130    
## movies_seen           0.6603    93.2239   0.007   0.9944    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1727 on 231 degrees of freedom
## Multiple R-squared:  0.1053, Adjusted R-squared:  0.02392 
## F-statistic: 1.294 on 21 and 231 DF,  p-value: 0.1799
model_log <- train(log(amount_spent+1) ~., purchases, method='lm')
summary(model_log) 
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6979 -0.7966  0.3095  1.2386  4.9356 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.9969500  0.3647111   8.217 3.72e-16 ***
## age                 0.0268490  0.0055856   4.807 1.65e-06 ***
## `jobblue-collar`   -0.1080271  0.1540656  -0.701 0.483276    
## jobentrepreneur     0.1480491  0.2988182   0.495 0.620338    
## jobmanagement      -0.1491229  0.2050723  -0.727 0.467206    
## jobretired         -0.6779986  0.2934667  -2.310 0.020974 *  
## `jobself-employed` -0.3279678  0.3187728  -1.029 0.303678    
## jobservices        -0.0406757  0.1782171  -0.228 0.819486    
## jobstudent          0.0058285  0.4071129   0.014 0.988579    
## jobtechnician      -0.2143712  0.1730962  -1.238 0.215696    
## jobunemployed       0.1853215  0.3120637   0.594 0.552675    
## jobunknown         -0.4088848  0.8266812  -0.495 0.620930    
## `jobwait-staff`     0.2097409  0.4527506   0.463 0.643229    
## maritalmarried      0.0604497  0.1351016   0.447 0.654607    
## maritalsingle       0.1085608  0.1581824   0.686 0.492604    
## educationsecondary  0.0851712  0.1270002   0.671 0.502529    
## educationtertiary   0.5189366  0.1864852   2.783 0.005442 ** 
## educationunknown   -0.0870435  0.2096850  -0.415 0.678102    
## seen_aloneyes      -1.0014765  0.2985237  -3.355 0.000809 ***
## discountyes         0.8722627  0.1447896   6.024 2.02e-09 ***
## days_member         0.0002015  0.0001776   1.135 0.256658    
## movies_seen        -0.0030210  0.0383358  -0.079 0.937196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.971 on 1978 degrees of freedom
## Multiple R-squared:  0.04423,    Adjusted R-squared:  0.03408 
## F-statistic: 4.358 on 21 and 1978 DF,  p-value: 1.679e-10

Predictive Analytics

One-hot and test/train

Splitting the test and train sets in preparation for comparing prediction models.

one.hot <- as.data.frame(model.matrix(~. -1, purchases))
set.seed(345)
trainIndex <- sample(nrow(one.hot), (nrow(one.hot)*.7))
train <- one.hot[trainIndex,]
test <- one.hot[-trainIndex,]
test_outcome <- test[,"amount_spent"]
train_outcome <- train[,"amount_spent"]
test <- test %>% select(-amount_spent)

Reusable compare function

A short function to quickly compare test and train RMSE calues across multiple models.

compare_models <- function(test, train, expected, model) {
  predict_test <- predict(model, test)
  predict_train <- predict(model, train)
  
  stats_svm <- as.matrix(rbind(
    mmetric(train$amount_spent, predict_train,c("MAE","MSE","RMSE","RAE")),
    mmetric(expected,predict_test,c("MAE","MSE","RMSE","RAE"))
  ))
  rownames(stats_svm) <- c("Train Set", "Test Set")
  knitr::kable(stats_svm, digits=2, caption = deparse(substitute(model))) %>% 
    kable_styling(bootstrap_options = c("hover"))
}

SVR Tuning

Taking a couple tuning rounds using the e1071 tuning method.

set.seed(123)
model.svm.radial <- svm(amount_spent~., data= train, kernal='radial', cost=10, scale=FALSE)
compare_models(test, train, test_outcome, model.svm.radial)
model.svm.radial
MAE MSE RMSE RAE
Train Set 421.22 889747.2 943.26 82.18
Test Set 461.87 1399393.4 1182.96 81.16
set.seed(123)
# perform grid search
tuneResult <- tune(svm, amount_spent~., data= train,
              ranges = list(epsilon = seq(0,1,0.1), cost = 2^(2:9))
)
print(tuneResult)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##      0.5    4
## 
## - best performance: 838189.6
# Draw the tuning graph
plot(tuneResult)

set.seed(123)
tuneResult2 <- tune(svm, amount_spent~., data= train,
              ranges = list(epsilon = seq(.42,.55,0.01), cost = seq(2,8,1))
)
print(tuneResult2) 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##     0.47    2
## 
## - best performance: 822950.6
plot(tuneResult2)

set.seed(123)
tuneResult3 <- tune(svm, amount_spent~., data= train,
              ranges = list(epsilon = seq(.47,.49,0.005), cost = seq(.1,3.1,.5))
)
print(tuneResult3)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##     0.48  0.6
## 
## - best performance: 816483.1
plot(tuneResult3)

set.seed(123)
tuneResult4 <- tune(svm, amount_spent~., data= train,
              ranges = list(epsilon = seq(.485,.51,0.005), cost = seq(.05,.15,.1))
)
print(tuneResult4)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##    0.505 0.15
## 
## - best performance: 815765.7
plot(tuneResult4)

set.seed(500)
tuneResult5 <- tune(svm, amount_spent~., data= train,
              ranges = list(epsilon = seq(.5,.52,0.002), cost = seq(.14,.2,.01))
)
print(tuneResult5)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##    0.516 0.15
## 
## - best performance: 813626.9
plot(tuneResult5)

model.svm.bestTune <- tuneResult5$best.model
compare_models(test, train, test_outcome, model.svm.bestTune)
model.svm.bestTune
MAE MSE RMSE RAE
Train Set 501.44 794100.7 891.12 97.83
Test Set 554.21 1297557.1 1139.10 97.38
# # perform a grid search
# set.seed(123)
# tuneResult.log <- tune(svm, log(amount_spent+1)~., data= train,
#               ranges = list(epsilon = seq(0,1,0.1), cost = 2^(2:9))
# )
# print(tuneResult.log)
# # Draw the tuning graph
# plot(tuneResult.log)
# 
# set.seed(123)
# tuneResult2.log <- tune(svm, log(amount_spent+1)~., data= train,
#               ranges = list(epsilon = seq(.42,.55,0.01), cost = seq(2,8,1))
# )
# print(tuneResult2.log) 
# plot(tuneResult2.log)
# 
# set.seed(123)
# tuneResult3.log <- tune(svm, log(amount_spent+1)~., data= train,
#               ranges = list(epsilon = seq(.48,.53,0.01), cost = seq(.1,3.1,.5))
# )
# print(tuneResult3.log)
# plot(tuneResult3.log)
# 
# set.seed(123)
# tuneResult4.log <- tune(svm, log(amount_spent+1)~., data= train,
#               ranges = list(epsilon = seq(.495,.51,0.005), cost = seq(.05,.15,.1))
# )
# print(tuneResult4.log)
# plot(tuneResult4.log)
# 
# set.seed(123)
# tuneResult5.log <- tune(svm, log(amount_spent+1)~., data= train,
#               ranges = list(epsilon = seq(.505,.52,0.002), cost = seq(.01,.06,.01))
# )
# print(tuneResult5.log)
# plot(tuneResult5.log)

set.seed(123)
tuneResult6.log <- tune(svm, log(amount_spent+1)~., data= train,
              ranges = list(epsilon = seq(.528,.552,0.003), cost = seq(.01,.03,.01))
)
print(tuneResult6.log)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  epsilon cost
##    0.534 0.02
## 
## - best performance: 4.006164
plot(tuneResult6.log)

model.svm.log.bestTune <- tuneResult6.log$best.model
compare_models(test, train, test_outcome, model.svm.log.bestTune)
model.svm.log.bestTune
MAE MSE RMSE RAE
Train Set 506.75 1070716 1034.75 98.86
Test Set 526.88 1578593 1256.42 92.58

Model Training

## since heavily positive-skewed and bounded at 0, trying log of amount spent
set.seed(123)
model.log <- train(log(amount_spent+1) ~., train, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
set.seed(123)
model.svm.tuned <- svm(amount_spent ~., data=train, epsilon=.51, cost=.16)

set.seed(123)
model.svm.log.tuned <- svm(log(amount_spent+1) ~., data=train, epsilon=.549, cost=.01)

set.seed(123)
model.svm.simple <- svm(amount_spent~., data = train)

set.seed(123)
model.lm <- train(amount_spent ~., train, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

Model Comparison

compare_models(test, train, test_outcome, model.svm.simple)
model.svm.simple
MAE MSE RMSE RAE
Train Set 391.97 801454.8 895.24 76.47
Test Set 476.60 1370485.6 1170.68 83.74
compare_models(test, train, test_outcome, model.svm.tuned)
model.svm.tuned
MAE MSE RMSE RAE
Train Set 498.94 793204.5 890.62 97.34
Test Set 551.77 1297742.5 1139.19 96.95
compare_models(test, train, test_outcome, model.svm.log.tuned)
model.svm.log.tuned
MAE MSE RMSE RAE
Train Set 506.77 1070752 1034.77 98.87
Test Set 526.90 1578623 1256.43 92.58
compare_models(test, train, test_outcome, model.lm)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model.lm
MAE MSE RMSE RAE
Train Set 506.60 793190.6 890.61 98.83
Test Set 549.55 1284606.8 1133.40 96.56
compare_models(test, train, test_outcome, model.log)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model.log
MAE MSE RMSE RAE
Train Set 506.92 1070943 1034.86 98.90
Test Set 527.05 1578788 1256.50 92.61

Review Text Analysis

## check for null review text
which(!complete.cases(reviews$text))
## integer(0)
reviews <- reviews %>% 
  mutate(
    text = as.character(text),
    valence = factor(ifelse(reviews$star<3, "Negative", "Positive"))
  )
summary(reviews)
##       star          text               valence  
##  Min.   :1.00   Length:75          Negative: 7  
##  1st Qu.:3.00   Class :character   Positive:68  
##  Median :4.00   Mode  :character                
##  Mean   :4.04                                   
##  3rd Qu.:5.00                                   
##  Max.   :5.00
ggplot(reviews, aes(str_length(text), fill=factor(star))) + 
  geom_boxplot() + 
  ggtitle("review length ~ star rating")

Data prep

## create corpus and include star rating for segmenting
reviews.corpus <- corpus(reviews$text)
docvars(reviews.corpus, "star") <- reviews$star
docvars(reviews.corpus, "valence") <- reviews$valence

reviews.dfm <- dfm(reviews.corpus, 
                   remove=stopwords('english'), 
                   remove_punct=TRUE, 
                   remove_symbols=TRUE, 
                   remove_separators=TRUE,
                   )

reviews.dfm.valence <- dfm(reviews.corpus, 
                   remove=stopwords('english'), 
                   remove_punct=TRUE, 
                   remove_symbols=TRUE, 
                   remove_separators=TRUE,
                   groups = 'valence'
                   )

reviews.dfm.star <- dfm(reviews.corpus, 
                   remove=stopwords('english'), 
                   remove_punct=TRUE, 
                   remove_symbols=TRUE, 
                   remove_separators=TRUE,
                   groups = 'star'
                   )

reviews.tfidf <- dfm_tfidf(reviews.dfm)

Word Clouds

set.seed(100)

textplot_wordcloud(reviews.dfm, 
                   min_count=3, 
                   random_order = FALSE,
                   rotation=.25,
                   color = RColorBrewer::brewer.pal(8,"Dark2")
                   )

textplot_wordcloud(reviews.dfm.valence,
                   comparison=TRUE,
                   min_count=3
                   )

textplot_wordcloud(reviews.dfm.star,
                   comparison=TRUE,
                   min_count=3
                   )

Keyness

reviews.keyness <- textstat_keyness(reviews.dfm, target=reviews$valence=="Positive")
textplot_keyness(reviews.keyness, margin=.1, n=13)

Topic Modeling: Quanteda

I used Quanteda and SeededLDA to practice a different LDA implementation. A knitting error is preventing me from including the code. The output in RStudio is:

topic1 topic2 topic3
movie time food
great back just
place moviemagic go
food cinema like
fun movies popcorn
good also get
love really beer
can going ordered
pizza good really
theater awesome got
model_lda <- textmodel_lda(reviews.dfm, k=3)
# as.data.frame(terms(model_lda, 10))

Topic Modeling: e1071

# perform a Latent Dirichlet Analysis (several lines of code to get you started)
# first remove stop words
corpus <- VCorpus(VectorSource(reviews$text))
# a function to clean /,@,\\,|
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|\\|")
corpus<- tm_map(corpus, stripWhitespace) # remove white space
# covert all to lower case else same word as lower and uppercase will classified as different
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removePunctuation) # remove punctuations
corpus <- tm_map(corpus, removeWords, stopwords("en"))
dtm <- DocumentTermMatrix(corpus)
set.seed(234)
rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals> 0, ]
lda <- LDA(dtm, k = 3, method = "Gibbs", control = NULL)
topics <- tidytext::tidy(lda, matrix = "beta") # beta is the topic-word density
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>% # top_n picks 10 topics.
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

# perplexity calculation - change k = values
lda.def <- LDA(dtm, k = 4, control = NULL) 
perplexity(lda.def)
## [1] 367.3747

Questions

  • For questions 1, 2, 3, and 4, the first dataset project_data would be used.
  • For Questions 1, 2 and 3 use the complete dataset (do not split it into train and test sets)
  • For question 4 split the dataset into train and test. You can try different splits; indicate what proportion you split the data when you compared SVR and linear regression.
  • For questions 5 and 6, you would be using the second dataset, the project_reviews dataset.

Q1-3 Model

summary(model.lm.fullset) 
## 
## Call:
## lm(formula = amount_spent ~ ., data = purchases)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1131.3  -410.4  -269.8    22.3  9575.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         243.68847  180.06384   1.353  0.17610   
## age                   4.78582    2.75768   1.735  0.08282 . 
## jobblue-collar      -19.03636   76.06472  -0.250  0.80241   
## jobentrepreneur     135.54936  147.53145   0.919  0.35832   
## jobmanagement       105.30974  101.24757   1.040  0.29841   
## jobretired         -130.82587  144.88928  -0.903  0.36667   
## jobself-employed    -19.70345  157.38334  -0.125  0.90038   
## jobservices           0.06220   87.98871   0.001  0.99944   
## jobstudent          -92.11807  200.99827  -0.458  0.64679   
## jobtechnician       -14.76209   85.46043  -0.173  0.86288   
## jobunemployed       246.57788  154.07094   1.600  0.10967   
## jobunknown           33.40031  408.14602   0.082  0.93479   
## jobwait-staff       590.69694  223.53039   2.643  0.00829 **
## maritalmarried      -43.62729   66.70185  -0.654  0.51315   
## maritalsingle       -61.88122   78.09725  -0.792  0.42825   
## educationsecondary   54.10943   62.70206   0.863  0.38826   
## educationtertiary   264.07703   92.07081   2.868  0.00417 **
## educationunknown     63.18623  103.52489   0.610  0.54170   
## seen_aloneyes      -249.44346  147.38601  -1.692  0.09072 . 
## discountyes          10.38529   71.48502   0.145  0.88451   
## days_member           0.11692    0.08768   1.334  0.18251   
## movies_seen          -6.59583   18.92702  -0.348  0.72751   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 973.2 on 1978 degrees of freedom
## Multiple R-squared:  0.02532,    Adjusted R-squared:  0.01497 
## F-statistic: 2.447 on 21 and 1978 DF,  p-value: 0.0002671

Question 1

Of the 8 predictors, which predictors have a significant influence on amount spent on concessions? Running a simple linear regression including all the original variables and no pre-processing yields a model showing 2 variables are highly correlated with concession purchasing amount (p<.01):

  • having a job as wait-staff
  • having a tiertiary education degree

In my opinion, these results should not be acted upon, as the model is only able to explain 2.5% of the variability of amount someone spends on consessions. I used the model model.lm.fullset to answer this question.

Question 2

Which predictors have a positive influence and which predictors have a negative influence on the amount spent on concessions? Which analysis, regression or SVR, helped you answer this question?

These variables have a positive correlation or influence with the amount spent on consessions:

  • Age (age)
  • Having a job as an entrepreneur, management, services,unemployed, unknown, or as wait-staff (jobentrepreneur, jobmanagement, jobservices, jobunemployed, jobunknown, jobwait-staff)
  • unknown, secondar, or tertiary education status (educationsecondary, educationtertiary, educationunknown)
  • Received a discount coupon (discountyes)
  • number of days they’ve been a member (days_member)

These variables have a negative correlation or influence with the amount spent on consessions:

  • Having a blue collar job, being self employed or retired, or being a student or technician (jobblue-collar, jobretired, jobself-employed, jobstudent, jobtechnician)
  • Being married or single (maritalmarried, maritalsingle)
  • They were seen along (seen_aloneyes)
  • The number of movies they saw (movies_seen)

Since the intercept is positive, you can say holding all variables at 0, the default state has a positive influence. Since we one-hot encoded, these variables would be:

  • Being divorced
  • Having a primary education
  • Having a job as an admin
  • Did not receive a discount coupon
  • Were not seen alone

I used a white-box, linear regression model to answer this question. SVR is a black box model and would not show coefficient values.

Question 3

Given the significant predictors, what strategies can MovieMagic come up with to increase amount spent on concessions?

  • Create a marketing campaign aimed at those who’ve been a member for awhile. This campaign could center around a special incentive, like a discount coupon.
  • Create a marketing campaign geared towards entrepreneurs and management who are hosting large groups of people. You can even look into special catering services if this proves to be a prolific spending group.
  • Offer a coupon or discount for groups buying at the consession stand.
  • Create various sized ‘family deals’ geared towards groups of 2-10 which are a pre-determained set of food + drink combos offered at a light discount
  • Offer a slight discount for first time members buying movie entrances on multiple days, with the intention of them buying more on consessions since they’re more likely to come back for the 5th movie they pre-paid for.

Q4 Model Comparisons

compare_models(test, train, test_outcome, model.svm.simple)
model.svm.simple
MAE MSE RMSE RAE
Train Set 391.97 801454.8 895.24 76.47
Test Set 476.60 1370485.6 1170.68 83.74
compare_models(test, train, test_outcome, model.svm.tuned)
model.svm.tuned
MAE MSE RMSE RAE
Train Set 498.94 793204.5 890.62 97.34
Test Set 551.77 1297742.5 1139.19 96.95
compare_models(test, train, test_outcome, model.svm.log.tuned)
model.svm.log.tuned
MAE MSE RMSE RAE
Train Set 506.77 1070752 1034.77 98.87
Test Set 526.90 1578623 1256.43 92.58
compare_models(test, train, test_outcome, model.lm)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model.lm
MAE MSE RMSE RAE
Train Set 506.60 793190.6 890.61 98.83
Test Set 549.55 1284606.8 1133.40 96.56
compare_models(test, train, test_outcome, model.log)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading

## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
model.log
MAE MSE RMSE RAE
Train Set 506.92 1070943 1034.86 98.90
Test Set 527.05 1578788 1256.50 92.61

Question 4

Which analysis, linear regression or SVR, provides better prediction? Which metric would you focus on to support your answer?

I looked at both MAE and RMSE when comparing models. The simple SVR model, model.svm.simple, which used default epsilon and cost, had the lowest MAE of all models - $476.60. The linear regression, model.lm, however, had the lowest RMSE - $1133.40. What this tells me is model.svm.simple had greater variance of error - meaning when it was wrong, it was wrong by more than when model.lm was wrong. Which metric (and hence which model) to use will depend on the application of the prediction and how undesirable residual variance is.

While not shown here, RMSLE might be a better metric to compare models since it might account better for the curve of amount spent - if we predict someone will spend $50 when they spent $5, that’s a different situatin than if we predicted someone would spend $3050 when they actually spent $3000. RMSLE might be more appropriate if we’re looking at the whole range of spending.


Q5 Wordcloud

textplot_wordcloud(reviews.dfm.valence,
                   comparison=TRUE,
                   min_count=3
                   )

textplot_wordcloud(reviews.dfm.star,
                   comparison=TRUE,
                   min_count=3
                   )

textplot_keyness(reviews.keyness, margin=.1, n=13)

Question 5

MovieMagic wants to visualize the reviews through a wordcloud and wants to find out which words are used most commonly in the reviews that customers write for MovieMagic. Create 2 wordclouds - one for reviews that received 3, 4, or 5 star ratings and another with reviews that received 1 or 2 stars ratings. Knowing the prominent words in each of the wordclouds, what strategies can be developed in messaging customers? Would the strategies differ?

It appears waiting for food may be one factor in a customer’s review of the theatre. I would suggest looking into measuring duratin between order submission and food delivery. This might help determine how to reduce poor reviews around consessions purchases. Such an analysis might focus on if certain items take longer to prepare, potentially causing people to miss their movie. Other factors might also influence these poor experiences such as number of employees working consessions at the time, time of day, number of movies starting within 15 minutes, or availability of certain items.

Popcorn was used in positive reviews. You might explore whether a separate line for popcorn only might increase these particular sales. I would suggest looking at whether other item sales decreae when this popcorn-only line is available. While this may gain sales for those who would be otherwise deterred from long lines, it may also decrease sales of items which would have otherwise been purchased concurrently.


Q6 Topic Models

### from quanteda package
# as.data.frame(terms(model_lda, 10))


### from topic models pkg
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Additional SeededLDA topic output:

topic1 topic2 topic3
movie time food
great back just
place moviemagic go
food cinema like
fun movies popcorn
good also get
love really beer
can going ordered
pizza good really
theater awesome got

Question 6

MovieMagic also wants to use topic modeling to find out whether the content in the reviews could be categorized into specific topics. If you used LDA to create 3 topic groups (k = 3) what would be the titles you would assign to these 3 topics. MovieMagic wants you to use the words within the topic to infer topic title. Given the topics you inferred what strategies would you suggest are possible for MovieMagic if it wants to increase concession sales. Would you recommend promotions or advertising or loyalty program; justify your choice of strategy?

Topic 1 - Classic Date Night

Need to get back to basics with your beau? We have a full menu for your and yours to indulge over while snuggled up in front of the big screen. Suggestions: When the alternative for date night includes finding and paying for both dinner and the movie and rushing in traffic to make sure you arrive on time, make it an easy choice for couples to dine at the teatre instead. This might include offering higher end meals paired with wines or a quiet, more upscale seating area where kids are not allowed for couples to reconnect before the movie. Everyone loves a good wine pairing.

Topic 2 - The Magic of Movies -or- Wholesome Family Tradition

Grab some popcorn and snacks and enjoy a fun flick with the whole family. Suggestions: Make it easy for parents bringing a kid or two -or even their 4 or 5 friends! Put together ‘Family Bundles’ with the most frequently co-purchased items catering to families or easy low-choice selections - each kid chooses 1 snack and 1 drink from a list. Take out the stress of coordinating food selections.

Topic 3 - Kick Back

Powerful sound system, full range high density graphics, and comfortable recliners to enjoy some cold beer and hot pizza with friends - this sounds like a place to relieve some stress. If you haven’t yet, look to expand your beer offerings! If there’s a better selection than competitors, you might be able to charge more for simply having a better selection available. Look to offer a discount when purchasing beer with food, especially in bulk - free 6th beer when bought with 2 pizzas, for example.